An assessment of gene prediction accuracy in large DNA sequences.

نویسندگان

  • R Guigó
  • P Agarwal
  • J F Abril
  • M Burset
  • J W Fickett
چکیده

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of First and Second Markov Chains Sensitivity and Specificity as Statistical Approach for Prediction of Sequences of Genes in Virus Double Strand DNA Genomes

Growing amount of information on biological sequences has made application of statistical approaches necessary for modeling and estimation of their functions. In this paper, sensitivity and specificity of the first and second Markov chains for prediction of genes was evaluated using the complete double stranded  DNA virus. There were two approaches for prediction of each Markov Model parameter,...

متن کامل

An Evolutionary and Phylogenetic Study of the BMP15 Gene

DNA sequence data contains a wealth of biologically useful information. Recent innovations in DNA sequencing technology have greatly increased our capacity to determine massive amounts of nucleotide sequences. These sequences can be used to specify the characteristics of different regions, interpret the evolutionary relationships between categorized groups, likelihood of performing multiple com...

متن کامل

An Effective Method for Detecting Y-chromosome Specific Sequences of Circulating Fetal DNA in Maternal Plasma During the First-trimester

Background and Aims: New advances in the use of cell-free fetal DNA (cffDNA) in maternal plasma of pregnant women has provided the possibility of applying cffDNA in prenatal diagnosis as a non-invasive method. One of the applications of prenatal diagnosis is fetal gender determination. Early prenatal determination of fetal sex is required for pregnant women at risk of X-linked and some endocrin...

متن کامل

In silico investigation of lactoferrin protein characterizations for the prediction of anti-microbial properties

Lactoferrin (Lf) is an iron-binding multi-functional glycoprotein which has numerous physiological functions such as iron transportation, anti-microbial activity and immune response. In this study, different in silico approaches were exploited to investigate Lf protein properties in a number of mammalian species. Results showed that the iron-binding site, DNA and RNA-binding sites, signal pepti...

متن کامل

Comparison of Promoter Sequences of Flowering Control Genes, FT1 and Three Versions of VIN3, in Susceptible and Resistant Sugar Beet Genotypes to Bolting

Autumn sowing of sugar beet is a suitable way in sustainable agriculture. Bolting is an undesirable phenomenon which reduces sugar beet yield and it is the most important limiting factor in autumn sowing of sugar beet. Identification and comparison of the sequence of flowering genes in various genotypes can help to understand the molecular mechanisms controlling bolting. In the previous studies...

متن کامل

Clustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information

Information theory is a branch of mathematics. Information theory is used in genetic and bioinformatics analyses and can be used for many analyses related to the biological structures and sequences. Bio-computational grouping of genes facilitates genetic analysis, sequencing and structural-based analyses. In this study, after retrieving gene and exon DNA sequences affecting milk yield in dairy ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Genome research

دوره 10 10  شماره 

صفحات  -

تاریخ انتشار 2000